Any webpage you visit has a particular, expected general structure. It usually consists of two types of code.
HTML code, which focuses on the appearance and format of a web page.XML code, which doesn’t look a lot different from HTML but focusses more on managing data in a web page.HTML codeHTML code has an expected format and structure, to make it easy for people to develop web pages. Here is an example of a simple HTML page:
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h1>This is a Heading</h1>
<p>This is a paragraph.</p>
</body>
</html>
As you can see, the content is wrapped in tags like <head></head>, <body></body>, <p></p>. These tags are pre-defined by the language (you can only use the tags that HTML allows). Because HTML has a more predictable structure, it is often easier to work with it and mine it.
XML codeXML format and structure is less predictable. Although it looks very similar to HTML, users can create their own named tags. Here is an example:
<note>
<to>Keith</to>
<from>Kevin Sneader</from>
<heading>Kudos</heading>
<body>Awesome work, dude!</body>
</note>
Tags like <to></to> and <from></from> are completely made up by me. The fact that tags are not pre-defined makes XML a little harder to mine and analyze. But it’s hard to get at some of the data on the web without using XML.
To mine web data, it’s important that you can see the underlying code and understand how it relates to what you are seeing on the page. The best way to do this (in my opinion) is to use the Developer Tools that come with Google Chrome.
When you are viewing a web page in chrome, simply used Ctrl+Shift+C in Windows or Cmd+Options+C on a Mac to open up the Elemenrs console where you can see all the code underlying the page. This can look really complex, but don’t worry. Here’s a photo of Google Chrome Developer open on the Billboard Hot 100 page:
If you play around with the code in the Developer you will see that it has an embedded structure.
<html> tag.<head> and <body> tags.<body> of the page, different elements are often separated by <div> tags.This is important because it means we can mine elements of a web page and treat them like lists in R. We often call a specific element of the page a node. So if we want to mine a specific node, we can capture its sub-nodes in a list. By doing so, this gives us the opportunity to apply the tidyverse when mining web pages. The process of mining data from the web is called scraping or harvesting.
rvest and xml2 packagesThe rvest and xml2 packages were designed to make it easier for people working in R to harvest web data. Since xml2 is a required package for rvest and the idea is that both packages work together, you only need to install rvest. First, lets ensure the packages we need are installed and loaded:
if (!("rvest" %in% installed.packages())) {
install.packages("rvest")
}
if (!("dplyr" %in% installed.packages())) {
install.packages("dplyr")
}
library(rvest)
## Loading required package: xml2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
rvest and xml2 contain functions that allow us to read the code of a web page, break it into a nead structure, and work with the pipe command to efficienly find and extract specific pieces of information. Think of it a bit like performing keyhole surgery on a webpage. One you understand what functions are available and what they do, it makes basic web scraping very easy and can produce really powerful functionality.
We are going to use the example of mining the Billboard Hot 100 page at https://www.billboard.com/charts/hot-100. If you view this page, it’s pretty bling. There are videos popping up, images all over the place. But the basic point of the page is to show the current Hot 100 chart.
So let’s set ourself the task of just harvesting the basic info from this page: Position Number, Artist, Song Title for the Hot 100.
First we load our packages and then we use the function read_html() to capture the HTML code of the Billboard Hot 100 page.
hot100page <- "https://www.billboard.com/charts/hot-100"
hot100 <- read_html(hot100page)
hot100
## {xml_document}
## <html class="" lang="">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="chart-page chart-page- " data-trackcategory="Charts-The ...
str(hot100)
## List of 2
## $ node:<externalptr>
## $ doc :<externalptr>
## - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
The function has captured the entire content of the page in the form of a special list-type document with two nodes <head> and <body>. Almost always we are interested in the body of a web pages. You can select a node using html_node() and then see its child nodes using html_children().
body_nodes <- hot100 %>%
html_node("body") %>%
html_children()
body_nodes
## {xml_nodeset (20)}
## [1] <div class="header-wrapper ">\n<header id="site-header" class="site ...
## [2] <div class="site-header__placeholder"></div>
## [3] <main id="main" class="page-content"><div class="chart-detail-heade ...
## [4] <div class="ad_desktop dfp-ad" data-position="promo" data-sizes="[[ ...
## [5] <footer id="site-footer" class="site-footer"><div class="container ...
## [6] <script>\n window.CLARITY = window.CLARITY || [];\n</script>
## [7] <div class="ad_clarity" data-out-of-page="true" style="display: non ...
## [8] <script>\n CLARITY.push({\n use: ['ads', 'cookies', 'head ...
## [9] <script type="text/javascript" src="https://assets.billboard.com/as ...
## [10] <script type="text/javascript" src="https://assets.billboard.com/as ...
## [11] <script type="text/javascript" src="https://assets.billboard.com/as ...
## [12] <script type="text/javascript" src="https://assets.billboard.com/as ...
## [13] <script type="text/javascript" src="https://assets.billboard.com/as ...
## [14] <script type="text/javascript">\n\tvar _sf_async_config={};\n\t/** ...
## [15] <script class="kxct" data-id="JsVUOKRj" data-timing="async" data-ve ...
## [16] <script class="kxint" type="text/javascript">\n window.Krux||((K ...
## [17] <script data-src="//platform.instagram.com/en_US/embeds.js"></script>
## [18] <script data-src="//platform.twitter.com/widgets.js"></script>
## [19] <div id="fb-root"></div>
## [20] <script type="text/javascript">\n PGM.createScriptTag("//connect ...
If we want, we can go one level deeper, to see the nodes inside the nodes, we can just continue to pipe deeper into the code:
body_nodes %>%
html_children()
## {xml_nodeset (12)}
## [1] <header id="site-header" class="site-header " role="banner"><div cl ...
## [2] <div class="header-wrapper__secondary-header">\n<nav class="site-he ...
## [3] <div class="chart-detail-header">\n<div class="chart-detail-header_ ...
## [4] <div class="ad-container leaderboard leaderboard--top">\n<div class ...
## [5] <div class="container chart-container container--xxlight-grey conta ...
## [6] <div class="chart-list__expanded-header">\n<div class="chart-list__ ...
## [7] <div id="dateSearchModal" class="date-search-modal" data-visible="f ...
## [8] <div class="ad-holder ad-holder--footer">\n<div class="ad_desktop d ...
## [9] <div class="container footer-content">\n<div class="cover-image">\n ...
## [10] <div class="container">\n<p class="copyright__paragraph">© 2019 Bil ...
## [11] <div class="container">\n<p class="station-identification">\nBillbo ...
## [12] <div class="container">\n<div class="ad_desktop dfp-ad dfp-ad-adhes ...
So we could mess around with the functions above for a long time, but might find it hard to work out where exactly this chart data is. This is where we will use Chrome Developer to tell us where we can find the data in the code, and then we can use rvest to harvest out the data.
If you run your mouse over the code in the Developer you will see that the elements of the page that the code revers to are highlighted in the browser. You can click to expand embedded nodes to get to more specific parts of the page. Watch this video to see how I progressively drill down the code to find the precise nodes that contain the details of each chart entry.